Add option to evict keys LRU from the sharded redis tables#3499
Add option to evict keys LRU from the sharded redis tables#3499pcmoritz merged 18 commits intoray-project:masterfrom
Conversation
|
Test FAILed. |
|
Test FAILed. |
|
Test FAILed. |
| // The task was not in the GCS task table. It must therefore be in the | ||
| // lineage cache. | ||
| RAY_CHECK(lineage_cache_.ContainsTask(task_id)); | ||
| RAY_CHECK(lineage_cache_.ContainsTask(task_id)) |
There was a problem hiding this comment.
This seems to be the narrow waist at which we access evicted lineage, though there could be other sites I'm missing.
| handler_warning_timeout_ms_(100), | ||
| heartbeat_timeout_milliseconds_(100), | ||
| num_heartbeats_timeout_(100), | ||
| num_heartbeats_timeout_(300), |
There was a problem hiding this comment.
Raising this to 30s since 10s is too easy to hit with random pauses (e.g., forking process takes a long time, or the kernel stalls compacting hugepages).
There was a problem hiding this comment.
Sounds good. It's possible that some of the tests are currently waiting for the full 10s, in which case that will become really slow. If that's the case and we observe that, then we can configure this parameter specifically in those tests.
|
Test FAILed. |
|
Checked and Ape-X seems to be stable at a aggressive 500MB redis memory limit. |
|
Test FAILed. |
|
Test FAILed. |
pcmoritz
left a comment
There was a problem hiding this comment.
This is working as intended for me.
|
Test FAILed. |
|
|
||
| # Check that we get warning messages for both raylets. | ||
| wait_for_errors(ray_constants.REMOVED_NODE_ERROR, 2, timeout=20) | ||
| wait_for_errors(ray_constants.REMOVED_NODE_ERROR, 2, timeout=40) |
There was a problem hiding this comment.
Ah for this test, can we actually do
internal_config=json.dumps({"num_heartbeats_timeout": 40}) or something like that?
python/ray/rllib/train.py
Outdated
| default=None, | ||
| type=int, | ||
| help="--redis-max-memory to pass to Ray." | ||
| " This only has an affect in local mode.") |
There was a problem hiding this comment.
Hi @ericl What does "local mode" mean? Does this work in multi-node mode?
There was a problem hiding this comment.
It just means that when using a cluster, you need to pass --redis-max-memory to ray start and not train.py
|
For now the memory flush policy do not support multiple redis shards. I notice that this PR can limit the used memory of each redis shard. If this PR is merged, does it mean we do not need redis memory flush anymore to some extent? |
|
@llan-ml that's right, this should supercede redis flushing. I updated the doc page to to remove the old flushing documentation. |
|
Test FAILed. |
|
Test FAILed. |
|
Test FAILed. |
What do these changes do?
This adds an experimental
redis_max_memoryflag that bounds the redis memory used per data shard. Note that this only applies to the non-primary shards which store the majority of the task and object metadata. Hence, stuff like client metadata is never evicted.Since profiling data has a nested structure and cannot be LRU evicted, also make profiling controlled by
collect_profiling_data, and disable it whenredis_max_memoryis set.Analysis of redis's approximate LRU eviction algorithm: https://github.com/antirez/redis/blob/a2131f907a752e62c78ea6bb719daf9fe2f91402/src/evict.c#L118
We use
maxmemory_samples 10. There is also a persisted eviction pool of 16 entries. This effectively gives us 26 tries per eviction to hit a old key (lower bound). Let's assume the most recent 30% of keys are required for stable operation, and we evict at 10000 QPS. Then:So a lower bound on reliability with approx LRU eviction is 99% per year. The actual reliability will be much higher of course since it's unlikely we need even 30% of the metadata, and also the eviction pool is persisted over time.
TODO:
Related issue number
#3306
#954
#3452